Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

expansion(kRSGeneric): Fix double apostrophe values with kRSUnicode #304

Merged
merged 3 commits into from
Dec 10, 2023

Conversation

tony
Copy link
Member

@tony tony commented Dec 10, 2023

Changes

  • Expansion(kRSGeneric): Fix non-apostrophe, simplified radicals

Context

This would cause an issue with unihan-db's example:

Error
~/cihai/cihai/unihan-db/tests/test_example.py:45: in test_01_bootstrap
    example.run()
        example    = <module 'run' from '~/cihai/cihai/unihan-db/examples/01_bootstrap.py'>
        project_root = PosixPath('~/cihai/cihai/unihan-db')
        unihan_options = {'source': PosixPath('/tmp/pytest-of-t/pytest-105/test_01_bootstrap0/Unihan.zip'), 'work_dir': PosixPath('/tmp/pytest-...st-105/test_01_bootstrap0'), 'zip_path': PosixPath('/tmp/pytest-of-t/pytest-105/test_01_bootstrap0/downloads/Moo.zip')}
~/cihai/cihai/unihan-db/examples/01_bootstrap.py:15: in run
    bootstrap.bootstrap_unihan(session)
        session    = <sqlalchemy.orm.scoping.scoped_session object at 0x7fc953a4e000>
        unihan_options = None
~/cihai/cihai/unihan-db/src/unihan_db/bootstrap.py:172: in bootstrap_unihan
    data = bootstrap_data(_options)
        _options   = {}
        options    = None
        session    = <sqlalchemy.orm.scoping.scoped_session object at 0x7fc953a4e000>
~/cihai/cihai/unihan-db/src/unihan_db/bootstrap.py:160: in bootstrap_data
    return p.export()
        _options   = {'expand': True, 'fields': ['kAccountingNumeric', 'kCangjie', 'kCantonese', 'kCheungBauer', 'kCihaiT', 'kCompatibility...t', 'Unihan_IRGSources.txt', 'Unihan_NumericValues.txt', 'Unihan_RadicalStrokeCounts.txt', 'Unihan_Readings.txt', ...)}
        options    = {}
        p          = <unihan_etl.core.Packager object at 0x7fc953a4da90>
~/.virtualenvsunihan-db-tpeMWywt-py3.12/lib/python3.12/site-packages/unihan_etl/core.py:575: in export
    data = expand_delimiters(data)
        data       = [{'char': '㐀', 'kAccountingNumeric': None, 'kCCCII': None, 'kCangjie': 'TM', ...}, {'char': '㐁', 'kAccountingNumeric':...I': None, 'kCangjie': 'JV', ...}, {'char': '㐅', 'kAccountingNumeric': None, 'kCCCII': None, 'kCangjie': 'K', ...}, ...]
        fields     = ['char', 'ucn', 'kAccountingNumeric', 'kCangjie', 'kCantonese', 'kCheungBauer', ...]
        files      = [PosixPath('unihan_etl/downloads/Unihan_DictionaryIndices.txt'), PosixPath('unihan_etl/d.../downloads/Unihan_RadicalStrokeCounts.txt'), PosixPath('unihan_etl/downloads/Unihan_Readings.txt'), ...]
        k          = 'char'
        raw_data   = <fileinput.FileInput object at 0x7fc9539f82c0>
        self       = <unihan_etl.core.Packager object at 0x7fc953a4da90>
~/.virtualenvsunihan-db-tpeMWywt-py3.12/lib/python3.12/site-packages/unihan_etl/core.py:409: in expand_delimiters
    char[field] = expansion.expand_field(field, char[field])
        char       = {'char': '亀', 'kAccountingNumeric': None, 'kCCCII': '2D632D', 'kCangjie': 'NWLU', ...}
        field      = 'kRSUnicode'
        normalized_data = [{'char': '㐀', 'kAccountingNumeric': None, 'kCCCII': None, 'kCangjie': 'TM', ...}, {'char': '㐁', 'kAccountingNumeric':...I': None, 'kCangjie': 'JV', ...}, {'char': '㐅', 'kAccountingNumeric': None, 'kCCCII': None, 'kCangjie': 'K', ...}, ...]
~/.virtualenvsunihan-db-tpeMWywt-py3.12/lib/python3.12/site-packages/unihan_etl/expansion.py:716: in expand_field
    return expansion_func(fvalue)
        expansion_func = <function _expand_kRSGeneric at 0x7fc955538fe0>
        field      = 'kRSUnicode'
        fvalue     = ['5.10', "213''.0"]
~/.virtualenvsunihan-db-tpeMWywt-py3.12/lib/python3.12/site-packages/unihan_etl/expansion.py:591: in _expand_kRSGeneric
    assert m is not None
E   assert None is not None
        expanded   = [{'radical': 5, 'simplified': False, 'strokes': 10}, "213''.0"]
        g          = {'radical': '5', 'simplified': '', 'strokes': '10'}
        i          = 1
        m          = None
        pattern    = re.compile("\n        (?P<radical>[1-9][0-9]{0,2})\n        (?P<simplified>\\'?)\\.\n        (?P<strokes>-?[0-9]{1,2})\n    ", re.VERBOSE)
        v          = "213''.0"
        value      = ['5.10', "213''.0"]
-----------------------------------------

Appendix

kRSUnicode

via https://www.unicode.org/reports/tr38/#kRSUnicode:

The standard radical-stroke count for this ideograph in the form “radical.additional strokes.” The radical is indicated by a number in the range 1–214, followed by an optional single apostrophe (U+0027 ' apostrophe) or double apostrophe ('') suffix. A single apostrophe after the radical indicates a Chinese simplified version of the given radical. Two apostrophes after the radical indicates a non-Chinese simplified version of the given radical. The “additional strokes” value is the residual stroke-count, the count of all strokes remaining after eliminating all strokes associated with the radical.

Copy link

codecov bot commented Dec 10, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (f4f9f16) 67.79% compared to head (01f5cb1) 67.79%.
Report is 1 commits behind head on master.

❗ Current head 01f5cb1 differs from pull request most recent head 7a219a4. Consider uploading reports for the commit 7a219a4 to get more accurate results

Additional details and impacted files
@@           Coverage Diff           @@
##           master     #304   +/-   ##
=======================================
  Coverage   67.79%   67.79%           
=======================================
  Files          12       12           
  Lines        1149     1149           
  Branches      215      215           
=======================================
  Hits          779      779           
  Misses        317      317           
  Partials       53       53           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@tony tony marked this pull request as ready for review December 10, 2023 13:10
…ied radical)

via https://www.unicode.org/reports/tr38/#kRSUnicode:

> The standard radical-stroke count for this ideograph in the form “radical.additional strokes.” The radical is indicated by a number in the range 1–214, followed by an optional single apostrophe (U+0027 ' apostrophe) or double apostrophe ('') suffix. A single apostrophe after the radical indicates a Chinese simplified version of the given radical. Two apostrophes after the radical indicates a non-Chinese simplified version of the given radical. The “additional strokes” value is the residual stroke-count, the count of all strokes remaining after eliminating all strokes associated with the radical.
@tony tony merged commit f21e437 into master Dec 10, 2023
10 checks passed
@tony tony deleted the fix-kRSUnicode branch December 10, 2023 13:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant